Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
Recent advances in neural approaches greatly improve task-oriented dialogue (TOD) systems which assist users to accomplish their goals. However, such systems rely on costly manually labeled dialogs which are not available in practical scenarios. In this paper, we present our models for Track 2 of the SereTOD 2022 challenge, which is the first challenge of building semi-supervised and reinforced TOD systems on a large-scale real-world Chinese TOD dataset MobileCS. We build a knowledge-grounded dialog model to formulate dialog history and local KB as input and predict the system response. And we perform semi-supervised pre-training both on the labeled and unlabeled data. Our system achieves the first place both in the automatic evaluation and human interaction, especially with higher BLEU (+7.64) and Success (+13.6\%) than the second place.
translated by 谷歌翻译
传统意图分类模型基于预定义的意图集,仅识别有限的内域(IND)意图类别。但是用户可以在实用的对话系统中输入室外(OOD)查询。这样的OOD查询可以提供未来改进的方向。在本文中,我们定义了一项新任务,广义意图发现(GID),旨在将IND意图分类器扩展到包括IND和OOD意图在内的开放世界意图集。我们希望在发现和识别新的未标记的OOD类型的同时,同时对一组标记的IND意图类进行分类。我们为不同的应用程序方案构建了三个公共数据集,并提出了两种框架,即基于管道的框架和端到端,以实现未来的工作。此外,我们进行详尽的实验和定性分析,以理解关键挑战,并为未来的GID研究提供新的指导。
translated by 谷歌翻译
最近对结构偏见进行了针对情感三胞胎提取(ASTE)的利用,并改善了性能。另一方面,人们认识到,明确纳入结构偏见会对效率产生负面影响,而预验证的语言模型(PLM)已经可以捕获隐式结构。因此,出现了一个自然的问题:在PLM的背景下,结构性偏见仍然是必要的吗?为了回答这个问题,我们建议通过使用适配器在PLM中整合结构偏置并使用便宜的计算相对位置结构来代替句法依赖性结构来解决效率问题。基准评估是在Semeval数据集上进行的。结果表明,我们提出的结构适配器对PLM有益,并在一系列强大的基准范围内实现最先进的性能,但具有光参数需求和延迟较低。同时,我们引起了人们的担忧,即当前的评估默认值为小规模的数据不足。因此,我们为ASTE发布了一个大型数据集。新数据集的结果暗示,结构适配器在大规模上自信地有效和有效。总体而言,我们得出一个结论,即即使使用PLM,结构偏见仍然是必要的。
translated by 谷歌翻译
对话机器人已广泛应用于客户服务方案,以提供及时且用户友好的体验。这些机器人必须对对话的适当域进行分类,了解用户的意图并产生适当的响应。现有的对话预训练模型仅针对多个对话任务而设计,而忽略了弱监督的客户服务对话中的专家知识。在本文中,我们提出了一个新颖的统一知识提示预训练框架,ufa(\ textbf {u} nified Model \ textbf {f}或\ textbf {a} ll任务),用于客户服务对话。我们将客户服务对话的所有任务作为统一的文本到文本生成任务,并引入知识驱动的及时策略,以共同从不同的对话任务中学习。我们将UFA预先训练UFA,从实用场景中收集的大型中国客户服务语料库中,并对自然语言理解(NLU)和自然语言生成(NLG)基准进行了重大改进。
translated by 谷歌翻译
预训练的语言模型(PLM)在自然语言理解中的许多下游任务中取得了显着的性能增长。已提出了各种中文PLM,以学习更好的中文表示。但是,大多数当前模型都使用中文字符作为输入,并且无法编码中文单词中包含的语义信息。虽然最近的预训练模型同时融合了单词和字符,但它们通常会遭受不足的语义互动,并且无法捕获单词和字符之间的语义关系。为了解决上述问题,我们提出了一个简单而有效的PLM小扣手,该小扣子采用了对单词和性格表示的对比度学习。特别是,Clower通过对多透明信息的对比学习将粗粒的信息(即单词)隐式编码为细粒度表示(即字符)。在现实的情况下,小电动器具有很大的价值,因为它可以轻松地将其纳入任何现有的基于细粒的PLM中而无需修改生产管道。在一系列下游任务上进行的扩展实验表明,小动物的卓越性能超过了几个最先进的实验 - 艺术基线。
translated by 谷歌翻译
我们向Smartwatches提出智能解决方案,以评估洗手,以提高用户在高质量洗手中的意识和培养习惯。UWASH可以识别洗手的起始/偏移,测量每个手势的持续时间,并根据谁的指导来评分每个手势以及整个过程。从技术上讲,我们将洗手评估的任务称为计算机愿景中的语义分割问题,并提出了一种轻量级的Unet-Like Network,只有496英尺,有效地实现它。超过51个科目的实验表明,UWASH对样本 - 明智的洗手手势识别的准确性为92.27 \%,每次开始/偏移检测中的$ <$ 0.5 \ textit {秒}错误,以及100次\ extitIT {points}错误的$ <$ 5在用户依赖的设置中得分,虽然在交叉用户评估和交叉用户交叉位置评估中仍然有前景。
translated by 谷歌翻译
随着训练有素的变形金刚的蓬勃发展,在文本对建模方面取得了显着进展,以支持相关的自然语言应用。为文本匹配开发了两行方法:基于交互的模型,其在文本对中执行完整交互,以及与暹罗编码器独立地编码该对的基于代表的模型。前者由于其深度的相互作用建模能力而达到了令人信服的性能,但在推理延迟中牺牲了牺牲。后者是有效的,并且广泛采用实际使用,然而,由于缺乏相互作用,遭受严重的性能下降。虽然一些事先作品试图将交互式知识集成到基于代表的模型中,但考虑到计算成本,它们只执行在顶层的延迟交互或知识。较低层中的交互式信息仍然缺失,这限制了基于代表的解决方案的性能。为了解决这个问题,我们提出了一种新颖的\ Texit {Virtual}交互机制,称为诚实,以在没有\ Textit {实际推理计算的基于代表的模型中启用完整和深度交互建模。具体地,asg询问基于表示的编码器进行虚拟交互,以模拟行为作为基于交互的模型。此外,从基于相互作用的编码器蒸馏的知识被视为监督信号,以承诺虚拟交互的有效性。由于虚拟交互仅在培训阶段发生,因此该公司不会增加推理成本。此外,我们设计了一个适应的延迟交互策略,以充分利用学习的虚拟互动知识。
translated by 谷歌翻译
Remote photoplethysmography (rPPG) enables non-contact heart rate (HR) estimation from facial videos which gives significant convenience compared with traditional contact-based measurements. In the real-world long-term health monitoring scenario, the distance of the participants and their head movements usually vary by time, resulting in the inaccurate rPPG measurement due to the varying face resolution and complex motion artifacts. Different from the previous rPPG models designed for a constant distance between camera and participants, in this paper, we propose two plug-and-play blocks (i.e., physiological signal feature extraction block (PFE) and temporal face alignment block (TFA)) to alleviate the degradation of changing distance and head motion. On one side, guided with representative-area information, PFE adaptively encodes the arbitrary resolution facial frames to the fixed-resolution facial structure features. On the other side, leveraging the estimated optical flow, TFA is able to counteract the rPPG signal confusion caused by the head movement thus benefit the motion-robust rPPG signal recovery. Besides, we also train the model with a cross-resolution constraint using a two-stream dual-resolution framework, which further helps PFE learn resolution-robust facial rPPG features. Extensive experiments on three benchmark datasets (UBFC-rPPG, COHFACE and PURE) demonstrate the superior performance of the proposed method. One highlight is that with PFE and TFA, the off-the-shelf spatio-temporal rPPG models can predict more robust rPPG signals under both varying face resolution and severe head movement scenarios. The codes are available at https://github.com/LJW-GIT/Arbitrary_Resolution_rPPG.
translated by 谷歌翻译